The conditional mean will be our first predictive algorithm. Conditional means answer the question: “Given what we know about a certain case, what can expect to see, on average?” The conditional mean is a powerful tool that is typically quite easy to explain to decision-makers.
We’ll go through the following steps:
We will be working with a dataset put together by the census bureau that summarizes the characteristics of the 3,088 counties in the United States.
load("pd.Rdata")
pd
## # A tibble: 3,088 × 55
## fips pop2013 pop2010_base popchange_pc pop2010 popu5 popu18 pop65p
## * <chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl>
## 1 01001 55246 54571 1.2 54571 6.1 25.4 13.5
## 2 01003 195540 182265 7.3 182265 5.7 22.4 18.1
## 3 01005 27076 27457 -1.4 27457 5.8 21.1 15.9
## 4 01007 22512 22919 -1.8 22915 5.3 21.3 14.3
## 5 01009 57872 57322 1.0 57322 6.1 23.8 16.4
## 6 01011 10639 10915 -2.5 10914 6.3 21.0 14.6
## 7 01013 20265 20946 -3.3 20947 6.1 23.3 17.7
## 8 01015 116736 118572 -1.5 118572 5.8 22.2 15.5
## 9 01017 34162 34170 0.0 34215 5.9 21.4 17.8
## 10 01019 26203 25986 0.8 25989 4.8 20.7 20.1
## # ... with 3,078 more rows, and 47 more variables: female_pc <dbl>,
## # white_pc <dbl>, black_pc <dbl>, am_ind_pc <dbl>, asian_pc <dbl>,
## # hawaii_pi_pc <dbl>, twomore_race_pc <dbl>, hispanic_pc <dbl>,
## # white_non_hispanic_pc <dbl>, same_house_pc <dbl>,
## # foreign_born_pc <dbl>, other_eng_home_pc <dbl>, hs_grad_pc <dbl>,
## # coll_grad_pc <dbl>, veterans <int>, travel_time <dbl>,
## # housing_units <int>, homeown_rate <dbl>, house_unit_multi <dbl>,
## # median_home_val <int>, households <int>, person_per_hh <dbl>,
## # per_capita_inc <int>, median_hh_inc <int>,
## # persons_below_poverty <dbl>, pv_nonfarm <int>,
## # pv_nonfarm_employ <int>, pv_nonfarm_employ_ch <dbl>,
## # nonemployer_est <int>, firms <int>, firms_black_own_pc <dbl>,
## # firms_amind_own_pc <dbl>, firms_asian_own_pc <dbl>,
## # firms_hawaii_pi_own_pc <dbl>, firms_hispanic_own_pc <dbl>,
## # firms_female_own_pc <dbl>, manufacture_ship <dbl>, wholesale <dbl>,
## # retail <dbl>, retail_percap <int>, hospitality <int>,
## # bldg_permits <int>, land_area <dbl>, pop_per_square <dbl>,
## # county <chr>, percapinc.2010 <dbl>, percapinc.2012 <dbl>
The codebook for this dataset is stored as another dataset, labels_explain. The first column in this dataset is variable names, the second column is a full explanation of that variable.
## Full explanation of data
load("pd_lab_explain.Rdata")
lab_explain
## # A tibble: 51 × 2
## varname label
## <fctr> <fctr>
## 1 pop2013 Population, 2013 estimate
## 2 pop2010_base Population, 2010 (April 1) estimates base
## 3 popchange_pc Population, percent change - April 1, 2010 to July 1, 2013
## 4 pop2010 Population, 2010
## 5 popu5 Persons under 5 years, percent, 2013
## 6 popu18 Persons under 18 years, percent, 2013
## 7 pop65p Persons 65 years and over, percent, 2013
## 8 female_pc Female persons, percent, 2013
## 9 white_pc White alone, percent, 2013
## 10 black_pc Black or African American alone, percent, 2013
## # ... with 41 more rows
#or use View
#View(lab_explain)
Quick Exercise: Find the per capita income and the percent of the population with a bachelor’s degree for the county you’re from.
pd%>%filter(fips=="06085")%>%select(per_capita_inc,coll_grad_pc)
## # A tibble: 1 × 2
## per_capita_inc coll_grad_pc
## <int> <dbl>
## 1 41041 46
Our working example will be based on predicting income in a given county. Suppose we want to know what income level we can expect for a geographic area based on observed characteristics, such as the proportion of the population with a bachelor’s degree. How would we predict the income based on what we know about the geographic area?
Let’s begin by plotting the data to see what it looks like. To do this I need to first rank the counties by income. To create a rank variable that will be stored in the pd dataset, I use the mutate command. This creates a variable based on some calculation then stores it in the same dataset. I’m then going to plot incomes for each county in descending rank order. Using the plotly library I can make this interactive so we know which counties we’re talking about.
## Create a rank variable for income
pd<-pd%>%mutate(percapinc_rank=rank(percapinc.2010))
## Plot by rank
gg<-ggplot(data=pd , aes(x=percapinc_rank,
y=percapinc.2010,
text=county))
##Add Axis Labels
gg<-gg+xlab("Rank")+ylab("Per Capita Income, 2010")
## Add Points
gg<-gg+geom_point(alpha=.5,size=.5)
gg
## Save for later
gg1<-gg
# Make Interactive plot
gg_p<-ggplotly(gg)
gg_p
If you were asked to predict the income for a given area without any additional information, the likely best guess is the overall average. We’re going to begin with the unconditional mean, or simple average, as our first prediction. We’ll again use the mutate command to plug in a variable that will be the average for every county, and we’ll plot this as a predictor.
Our notation for the unconditional mean as a predictor is:
\[\hat{Y}=\bar{Y} \]
##Unconditional Average
pd%>%summarize(mean_percapinc.2010=mean(percapinc.2010,na.rm=TRUE))
## # A tibble: 1 × 1
## mean_percapinc.2010
## <dbl>
## 1 34002.98
##Unconditional Average as a Predictor
pd<-pd%>%mutate(mean_percapinc.2010=mean(percapinc.2010,na.rm=TRUE))
##Plotting
gg<-ggplot(data=pd,aes(y=percapinc.2010,x=percapinc_rank,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(y=mean_percapinc.2010,x=percapinc_rank,
color="Predicted: Unconditional Mean"),
size=.5)
gg<-gg+xlab("Rank of Per Capita Income")+ylab("Per Capita Income")
gg<-gg+scale_color_manual(name="Type",
values=c("Actual"="black",
"Predicted: Unconditional Mean"="blue")
)
gg<-gg+theme(legend.position="bottom")
gg
##Save for later
gg2<-gg
This is of course a terrible prediction. In the absence of any other information, it’s many times the best we can do, but we really ought to be able to do better.
To understand how far off we are, we need to summarize our errors. We will use different ways of doing this this semester, but let’s start with a very standard one, Root Mean Squared Error, or RMSE. An error term is the vertical distance between each point and its prediction. The RMSE is the square root of the sum of squared errors. (Q:why do we square them?).
\[RMSE(\hat{Y})=\sqrt{ 1/n \sum_{i=1}^n(Y_i-\hat{Y_i})^2} \]
pd<-pd%>%mutate(e1=percapinc.2010-mean_percapinc.2010)
## RMSE
rmse_uncond_mean<-rmse(pd$percapinc.2010,pd$mean_percapinc.2010)
rmse_uncond_mean
## [1] 7817.037
To incorporate additional information into the mean, we need to calculate averages at levels of other predictors. Let’s calculate the average per capita income at different levels of college education. The code below will calculate average income across counties at four different levels of college education– the four quantiles of college education in the dataset.
##Condtional Average across a single variable
## Create a variable for quartiles of college education
pd<-pd%>%mutate(coll_grad_level=ntile(coll_grad_pc,4))
table(pd$coll_grad_level)
##
## 1 2 3 4
## 772 772 772 772
pd<-pd%>%group_by(coll_grad_level)%>% ## Group by predictor
##Calculate mean at each level of predictor
mutate(pred_income_college=mean(percapinc.2010))%>%
## Ungroup
ungroup()%>%
#Rank by prediction, with ties sorted randomly
mutate(pred_income_college_rank=rank(pred_income_college,ties.method="random"))
pd%>%group_by(coll_grad_level)%>% ## Group by predictor
##Calculate mean at each level of predictor
summarise(pred_income_college=mean(percapinc.2010))
## # A tibble: 4 × 2
## coll_grad_level pred_income_college
## <int> <dbl>
## 1 1 28055.60
## 2 2 32371.22
## 3 3 35291.76
## 4 4 40293.34
gg<-ggplot(data=pd,aes(x=pred_income_college_rank,y=percapinc.2010,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(x=pred_income_college_rank,y=pred_income_college,color="Predicted:Conditional Mean, 1 var"))
gg<-gg+ scale_color_manual("Type",values=c("Predicted:Conditional Mean, 1 var"="red","Actual"="black"))
gg<-gg+theme(legend.position="bottom")
gg<-gg+xlab("Rank")+ylab("Per Capita Income, 2010")
gg
##Save for later
gg3<-gg
Our notation for this predictor will be:
\[\hat{Y}=(\bar{Y}|X=x) \]
That is, the predicted value of y, \(\bar{Y}\) is equal to the mean value of \(Y\) given our predictor \(X\) (college graduate levels in this case) is equal to a given value of \(X\), denoted by \(x\).
Quick Exercise: Calculate per capita income as a function of the proportion of the county with a high school education
Let’s see what happened to our RMSE when we did a conditional as opposed to an unconditional mean.
rmse_cond_mean_one<-rmse(pd$percapinc.2010,pd$pred_income_college)
rmse_cond_mean_one
## [1] 6425.743
The next step is to then incorporate more information from additional variables. Let’s calculate the average income by both quartiles of both the population with a bachelor’s degree and median home values.
##Condtional average across multiple variables
## Calculate quartiles of home value
pd<-pd%>%mutate(median_home_level=ntile(median_home_val,4))
table(pd$median_home_level)
##
## 1 2 3 4
## 772 772 772 772
##Make prediction: income by quartiles of education and home values
pd<-pd%>%group_by(coll_grad_level,median_home_level)%>% ## Grouping at multiple levels
mutate(pred_income_college_home=mean(percapinc.2010))%>%
ungroup()%>%
mutate(pred_income_college_home_rank=rank(pred_income_college_home,ties.method="random"))
## Showing the various levels
gg<-ggplot(data=pd,aes(x=pred_income_college_home_rank,y=percapinc.2010,color="Actual"))
gg<-gg+geom_point(alpha=.5,size=.5)
gg<-gg+geom_point(aes(x=pred_income_college_home_rank,
y=pred_income_college_home,
color="Predicted:Conditional Mean, 2 vars")
)
gg<-gg+scale_color_manual("Type",values=c("Actual"="black",
"Predicted:Conditional Mean, 2 vars"="orange"
))
gg<-gg+theme(legend.position="bottom")
gg<-gg+xlab("Rank")+ylab("Per Capita Income")
gg
## Save for later
gg4<-gg
This is clearly much better: our predictions appear to be much closer to the actual data points more of the time.
And what happened to RMSE?
rmse_cond_mean_two<-rmse(pd$percapinc.2010,pd$pred_income_college_home)
rmse_cond_mean_two
## [1] 6250.65
Let’s put all of the plots together to compare.
gg5<-grid.arrange(gg1,gg2,gg3,gg4,nrow=2)
gg5
## TableGrob (2 x 2) "arrange": 4 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
## 3 3 (2-2,1-1) arrange gtable[layout]
## 4 4 (2-2,2-2) arrange gtable[layout]
Quick Exercise: Predict income using two other variables
When might we use the conditional mean?